Real-Time Ray Tracing with GPGPU and Multi-Core


This page is a continuation of previous work, Real-Time Ray Tracing on the Playstation 3 Cell Processor. The ray-tracing test case is identical. The code has been modified to utilize the NVIDIA GPU, and multiple cores using Posix threads.


In Four-Dimensional Cellular Automata Acceleration Using GPGPU I previously experimented with GPGPU (General Purpose computation on Graphics Processing Units). This programming was difficult because it involved essentially "fooling" OpenGL shaders into doing non-graphics computations. All programing was done through the OpenGL graphics API. With CUDA NVIDIA has provided a non-graphics API for doing general purpose programming on their 8 series GPUs. It is supported under Windows XP and Linux; drivers and SDK are available here.

While my example is graphical, only the SDL frame buffer capability of the GPU is being used. All object rendering computations are done by my CUDA code running on the GPU -- the card doesn't know this is a graphical computation.

The NVIDIA GPU has multiple multiprocessors: 12 on the GeForce 8800 GTS I tested. Each multiprocessor has 8 processors, and with them each multiprocessor can support up to 768 concurrent threads.

Under CUDA the programmer writes a kernel which is the code executed by a single thread. Threads are placed in two or three-dimensional groupings called blocks, and blocks in turn are placed in two-dimensional groupings called grids. The run-time system schedules blocks and grids to run as resources are available. The general computational model is SIMD: the same kernel instructions are executed by all threads. The NVIDIA C compiler supports fairly standard C with branches, loops, etc., but no recursion.

Like the SPUs on the PS3 Cell processor, GPU memory is limited and access to host memory is slow. Data is explicitly moved from the host to the device by the host program. Here data is in device global memory (shared between all multiprocessors), which is still quite slow. The kernel threads can copy data from global memory to shared memory (shared between processors in a single multiprocessor), though shared memory is limited to 16K per multiprocessor.

My Program on CUDA

The previous ray-tracing code has been split into host and device kernel portions. The device kernel is executed by each thread. The threads are arranged into 16x16 blocks, and the blocks in turn form a 64x35 grid. Each thread processes a single pixel of the 1024x564 display. At the beginning of each frame the current world state structure is copied by the host to the global memory of the device. The 0,0 thread in each block in turn copies this world state structure into shared memory to provide faster access for all the threads in its block. Note this shared memory is read-only from the threads point of view -- they never update it. At the end of each thread's computation it writes its results (a single pixel color) to another device global array. When all threads have completed the host program reads the device global array and writes it to the SDL frame buffer.

With this design CUDA is not a practical way to implement a ray-tracer. The 16K shared memory restriction makes realistic world models impractical. It appears the number of threads which can be started simultaneously is constrained by the size of the world state structure:

Here we see that with a small number of objects in the world the program runs faster with shared memory. With a large number of objects in the world it is actually faster without shared memory (and shared memory cannot fit more than 225 objects). The Shared memory speedup is independent of whether all the objects are actually rendered. The program bottleneck is not in looping over the objects, but in the speed of access to object data and the number of threads which can be simultaneously started.

Pthreads on Intel Quad-Core

The generic ray-tracer code has been modified to use pthreads. At the beginning of each frame 4 threads are created and given starting screen line numbers. Like the PS3 cell ray tracer, the first thread will render the 1,5,9, etc. line, while the second thread renders 2, 6, 10, etc. The main program does a join waiting for all the threads to complete. The program does not scale linearly with the number of CPUs, indicating it may be bottle-necking on writing the results to main memory. It might run faster if it batched its writes like the PS3 version.


The new code is available here

Below shows CPU and memory utilization during NVIDIA CUDA GPGPU, generic 1 pthread, and generic 4 pthreads tests. In the CUDA test my ray program used 92% of one CPU and the Xorg X-windows server used 12% of another CPU.

Update (7/2007)

I obtained an evaluation copy of the Intel C++ Compiler 10.0, Professional Edition, for Linux. I was curious to try their SSE auto-vectorization code. With 4 threads it was 21% faster than GCC.

Update (12/2007)

Intel C++ supports OpenMP, a standardized API for shared-memory multiprocessing in C++. By adding a simple #pragma loops can be split up to automatically run in parallel on multiple cores. I added a simple change to the per-screen-line rendering loop of the generic ray tracer:
    int screenX;
#ifdef USE_OMP
#pragma omp parallel for firstprivate(portPoint)
    for (screenX = 0;
         screenX < SCREEN_WIDTH;
         screenX++) {
With this change each pixel is rendered by a separate thread, with a maximum of 4 (the number of cores) running simultaneously. This use of threads is an alternative to the explicit pthread threading tested above, where a separate thread is used per line of the screen. In this case OpenMP is simpler than explicit threading, though it yields slightly lower performance.

GCC 4.2 also supports OpenMP, but its performance is much worse than Intel.

Update (1/2009)

I've ported the generic and CUDA ray tracers to Mac OS X running on a MacBook Pro laptop. It has a 2.4GHz Intel Core 2 Duo processor and a NVIDIA GeForce 9600M GT graphics processor. The Performance for 1 and 2 software threads is comparable to my older Linux desktop, and the CUDA is within a factor of 3 of the desktop graphics card.
Threads    Frames Per Second
1           8.5
2          17.4
GPU        16.7

Update 10/2010

I've re-run the generic ray tracer program on a 6 core processor with hyper-threading. The Intel i7-980X supports two hardware threads per core, so the OS sees it as a 12-core processor.
Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
Linux hex 2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010 x86_64 GNU/Linux
Ubuntu 10.10

This testing was paired with Go ACO TSP.

Here hyper-threading has improved performance for both programs in all configurations. At one core both programs were sped up 28%. At 6 cores ray tracing was sped up 9% versus non-HT, while ACO TSP was sped up 18% versus non-HT. The higher speedup for ACO TSP is expected as it matches the general speedup from increased cores.

Both programs were run with 24 threads in all test configurations. Having more program threads than cores*hardware_threads_per_core is key. Otherwise the OS may schedule multiple threads on the same core while leaving others idle. Linux is hyper-theading aware and attempts to avoid this scheduling problem, but doesn't always get it right. In testing with 6 threads and 6 cores, I saw 28% decreases in performance for both programs when hyper-threading was enabled.

The Intel i7-980X also supports "Turbo Boost", where one core is automatically overclocked when other cores are idle. Turbo boost was disabled for the tests above. With one core and HT disabled, Turbo boost provides a speedup of 8% for both programs. Surprisingly, with six cores and HT enabled (so all cores should be fully utilized) Turbo boost provides a speedup of 5% for both programs.

Update 11/2010

OpenCL is the new open standard for GPGPU (general-purpose computing on graphics processing units) programming. It supports both graphics cards and multi-core processors with the same C99-based source code. I've re-written the NVIDIA CUDA version of my real-time ray tracer using OpenCL, and run it on an ATI/AMD Radeon HD 5870 GPU. For comparison I also ran it in CPU mode on the Intel i7-980X (6 core, hyperthreaded) processor.

The OpenCL CPU version provided similar results to the pthreads-generic version tested earlier. With 10 objects in the scene the GPU performance was only slightly better than the CPU version. But as the number of objects increased the GPU version maintained performance much better than the CPU version. This indicates that with low object counts the actual work done by each GPU core is small relative to the setup and copying results back overheads. Yes, a realistic ray tracer would be culling the scene graph so fewer objects would be evaluated for each point. And the GPU version ran out of constant memory above 800 objects. But this is much better scaling and performance than the NVIDIA CUDA test I did a few years ago.